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ABSTRACT 


With  the  availability  of  very  large,  relatively  inexpensive  main  memories,  it  is  becoming  possi¬ 
ble  keep  large  databases  resident  in  main  memory.  In  this  paper  we  consider  the  changes  necessary  to 
permit  a  relational  database  system  to  take  advantage  of  large  amounts  of  main  memory.  We  evaluate 
AVL  vs.  B+  -tree  access  methods  for  main  memory  databases,  hash-based  query  processing  strategies  vs. 
sort-merge,  and  study  recovery  issues  when  most  or  all  of  the  database  fits  in  main  memory.  As  expected. 
B+  -trees  are  the  preferred  storage  mechanism  unless  more  than  80-90%  of  the  database  fits  in  main 
memory.  A  somewhat  surprising  result  is  that  hash  based  query  processing  strategies  are  advantageous 
for  large  memory  situations. 

Key  Words  and  Phrases:  Main  Memory  Databases,  Access  Methods,  Join  Algorithms,  Access  Planning, 
Recovery  Mechanisms 


1.  Introduction 


Throughout  the  past  decade  main  memory  prices  have  plummetted  and  are  expected  to  con¬ 
tinue  to  do  so.  At  the  present  time,  memory  for  super-minicomputers  such  as  the  VAX  11/780  costs 
approximately  $1,500  a  megabyte.  By  1900,  1  megabit  memory  chips  will  be  commonplace  and  should 
further  reduce  prices  by  another  order  of  magnitude.  Thus,  in  1990  a  gigabyte  of  memory  should  cost  less 
than  $2 00,000.  If  4  megabit  memory  chips  are  available,  the  price  might  be  as  low  as  $50,000. 

With  the  availability  of  larger  amounts  of  main  memory,  it  becomes  possible  to  contemplate 
the  storage  of  databases  as  main  memory  objects.  In  fact,  IMS  Fast  Path  [DATE82]  has  supported  such 
databases  for  some  time.  In  this  paper  we  consider  the  changes  that  might  be  needed  to  a  relational  data¬ 
base  system  if  most  (or  all)  of  a  relation(s)  is  (are)  resident  in  main  memory. 

In  Section  2,  the  performance  of  alternative  access  methods  for  main  memory  database  sys¬ 
tems  are  considered.  Algorithms  for  relational  database  operators  in  this  environment  are  presented  and 
evaluated  in  Section  3.  In  Section  4,  we  describe  how  access  planning  will  be  affected  by  the  availability 
of  large  amounts  of  main  memory  for  query  processing.  Section  5  discusses  recovery  in  memory  resident 
databases.  Our  conclusions  and  suggestions  for  future  research  are  contained  in  Section  6. 

2.  Access  Methods  for  Memory  Resident  Databases 

The  standard  access  method  for  data  on  disk  is  the  B+-tree  [COME79],  providing  both  ran¬ 
dom  and  sequential  key  access.  A  B+ -tree  is  specially  designed  to  provide  fast  access  to  disk-resident 
data  and  makes  fundamental  use  of  the  page  size  of  the  device.  On  the  other  hand,  if  a  keyed  relation  is 
known  to  reside  in  main  memory,  then  an  AVL  (or  other  binary)  tree  organization  may  be  a  better  choice 
In  this  section  we  analyze  the  performance  of  both  structure  for  a  relation  R  with  the  following  charac¬ 
teristics: 

||R||  number  of  tuples  in  relation  R 
K  width  of  the  key  for  R  in  bytes 

L  width  of  a  tuple  in  bytes 

P  page  size  in  bytes 
4  size  of  a  pointer  in  bytes 

We  have  analyzed  two  cases  of  interest.  The  first  is  the  cost  of  retrieving  a  single  tuple  using  a 
random  key  value.  An  example  of  this  type  of  query  is: 
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retrieve  (emp. salary)  where  emp.name  =  "Jones” 

The  second  case  analyzed  is  the  cost  of  reading  N  records  sequentially.  Consider  the  query 
retrieve  (emp. salary,  emp.name)  where  emp.name  =  "J»” 
which  requests  data  on  all  employees  whose  names  begin  with  J.  To  execute  this  query,  the  database  sys¬ 
tem  would  locate  the  first  employee  with  a  name  beginning  with  J  and  then  read  sequentially.  This 
second  case  analyzes  the  sequential  access  portion  of  such  a  command. 

For  both  cases  (random  and  sequential  access),  there  are  two  costs  that  are  specific  to  the 
access  method: 

|page  reads|  the  number  of  pages  read  to  execute  the  query 

|comparisonsf  the  number  of  record  comparisons  required  to 
isolate  the  particular  data  of  interest. 

The  number  of  comparisons  is  indicative  of  the  CPU  time  required  to  process  the  command  while  the 
number  of  page  reads  approximates  the  I/O  costs. 

To  compare  the  performance  AVL  and  B+  -trees,  we  propose  the  following  cost  function: 
cost  =  Z  »  |page-reads|  +  (comparisons! 

Since  a  page  read  consumes  perhaps  2000  instructions  of  operating  system  overhead  and  30  milliseconds  of 
elapsed  time  while  a  comparison  can  easily  be  done  in  200,  we  expect  realistic  values  of  Z  to  be  in  the 
rauge  of  10  to  30.  Later  in  the  section  we  will  use  several  vaiues  in  this  range. 

Moreover,  it  is  possible  (although  not  very  likely)  that  an  AVL-tree  comparison  will  be  cheaper 
than  a  B-b-tree  comparison.  The  reasoning  is  that  the  B+-tree  record  must  be  located  within  a  page 
while  an  AVL  tree  does  not  contain  any  page  structure  and  records  can  be  directly  located.  Conse¬ 
quently,  we  assume  that  an  AVL-tree  comparison  costs  Y  times  a  B+ -tree  comparison  for  some  Y  <  1. 
From  Knuth  [KNL'T73|,  we  can  observe  that  in  an  |(R||-tuple  AVL  tree  approximately 
C  =  logj||f?l|  +  0.25  comparisons 

are  required  to  6nd  a  tuple  in  a  relation.  Without  any  special  precautions  each  of  the  C  nodes  to  be 
inspected  will  be  on  a  different  page.1  Hence,  the  number  of  pages  accessed  is  approximately  C.  The 
1  If  a  paged  binary  tree  orgimutioa  is  iired  instead.  the  fanout  per  node  will  be  slightly  worse  than  the  B-tree  Furthermore. 


A. 
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AVL  structure  will  occupy  approximately 


Here  f,Y  1  denotes  the  smallest  integer  larger  than  X.  If  |M|  pages  of  main  memory  are  available,  and  if 
|  M  |  <  |  5  | ,  and  if  a  random  replacement  algorithm  is  used,  the  number  of  page  faults  to  find  a  tuple  in 


a  relation  will  be  approximately. 


faults  =  C  * 


Consequently  the  cost  of  a  random  access  by  Itey  is: 

co»t[AVL)  =  2.C*U-iyi-)  +  Y*C 

Next  we  derive  the  approximate  cost  for  a  random  access  to  a  tuple  using  a  B+  -tree.  Accord¬ 
ing  to  YAO  [YA078],  B-tree  nodes  are  approximately  69  percent  full  on  the  average.  Hence,  the  fanout  of 
a  B+  -tree  is  approximately 


The  number  of  leaf  nodes  will  be  about 


°  “  'Sf  data  page9 


The  height  of  a  B+  -tree  index  is  thereby 


height  — 


The  number  of  comparisons  required  to  locate  a  tuple  with  a  particular  value  is: 


C  '  =  fofc||K||] 


The  number  of  pages  which  the  tree  consumes  is  about 


5  '  =  D  + 


£_]  MM  MM[£ 

A  A  A  AAA 


To  a  first  approximation  S'  is 


S'  =  D  • 


=  D  V  — 

■— I  A 

T  —  0 


Again  the  number  of  page  faults  is  approximately 

naged  binary  trees  are  not  balanced  and  the  wont  case  access  time  may  be  significantly  poorer  than  in  the  case  of  a  B-tree 
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faults  =  ( height I)  *  (1 — ^TT^~) 

As  a  result  the  cost  of  a  B+-tree  access  by  key  is. 

cost(B+  -tree)  =  Z*(height  +  l)e(l  — ^;L)-r-  C  ' 

An  AVL-Tree  will  be  the  preferred  structure  for  case  1  if 

D1FF  =  cost(B+  -tree)  -  cost(AVL-Tree)  >  0 

If  we  assume  that  C  =  C  '  =  logj||T||  and  rearrange  the  terms  in  the  inequality,  then  an  AVL-Tree  will 
be  preferred  if: 

|l->')»log2||fl||  >  2‘logj)|«||»(l-iyJ-)- Z*(A«?A<+l)*(l--lj^-) 

Note  that  if  L  >>  8  then  S  =  0.69  •  S'.  Define  H  =•  ■  Some  simplification  yields: 

I  M  |  Z>n-H)+  l  -l  t 

S  Z  1.45 

Obviously,  if  |  M  \  >S,  then  AVL  trees  are  the  preferred  structure  regardless  of  the  values  of  H,  Y,  and  Z. 
In  this  situation,  the  entire  AVL-Tree  is  resident  in  main  memory  and  there  are  no  disk  accesses.  Since 
both  data  structures  require  the  same  aumber  of  comparisons  and  the  AVL  comparisons  are  cheapers. 
then  the  AVL-Tree  is  guaranteed  to  have  lower  cost.  If  J  M  |  <5  then  AVL  trees  will  be  preferred  if  the 

value  of  ■  I  ■  is  larger  than  the  value  of  min(|M|/S)  shown  as  in  Table  1.  .As  can  be  seen,  essentially  all 
of  a  relation  has  to  be  resident  in  main  memory  before  an  AVL  tree  is  the  preferred  structure  For  rea- 


Table  1  —  Minimum  Residency  Factor  For  Random  Access 


z 

Y 

H 

min  (|M|/S) 

10 

.5 

1 

.91 

10 

.5 

.87 

10 

.5 

.3 

82 

10 

.75 

.1 

.94 

10 

.75 

o 

.90 

10 

.75 

.3 

.86 

15 

75 

.1 

96 

15 

.75 

.2 

91 

15 

.75 

.3 

.86 
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3.1.  Introduction 

In  this  section  we  explore  the  performance  of  alternative  algorithms  for  relational  database 
operations  in  an  environment  with  very  large  amounts  of  main  memory.  Since  many  of  the  techniques 
used  for  executing  the  relational  join  operator  can  also  be  used  for  other  relational  operators  (e  g.  aggre¬ 
gate  functions,  cross  product,  and  division),  our  evaluation  efforts  have  concentrated  on  the  join  opera¬ 
tion.  However,  at  the  end  of  the  section,  we  discuss  how  our  results  extend  to  these  other  algorithms. 

After  introducing  the  notation  used  in  our  analysis,  we  present  an  analysis  of  the  familiar 
sort-merge  |BLAS77]  join  algorithm  using  this  notation.  Next  we  analyze  a  multipass  extension  of  the 
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simple  hashing  algorithm.  The  third  algorithm  described  is  an  algorithm  that  has  been  proposed  by  the 
Japanese  5th  generation  project  (KITS83],  and  is  called  GRACE  In  the  first  phase,  the  join  of  two  large 
relations  is  reduced  to  the  join  of  several  small  sets  of  tuples.  During  the  second  phase,  the  tuple  sets  are 
joined  using  a  hardware  sorter  and  a  sort-merge  algorithm.  Finally,  we  present  a  new  algorithm,  called 
the  Hybrid  algorithm.  This  algorithm  is  similar  to  the  GRACE  algorithm  in  that  it  partitions  a  join  into 
a  set  of  smaller  joins.  However,  during  the  second  phase,  hashing  is  used  instead  of  sort  merge. 

In  the  following  sections  we  develop  cost  formulas  for  each  of  the  four  algorithms  and  report 
the  result  of  analytic  simulations  of  the  four  algorithms.  Our  results  indicate  that  that  the  Hybrid  algo¬ 
rithm  is  preferable  to  all  others  over  a  large  range  of  parameter  values. 

3.2.  Notation  and  Assumptions 

Let  R  and  S  be  the  two  relations  to  be  joined.  The  number  of  pages  in  these  two  relations  is 
denoted  |R|  and  |S|,  respectively.  The  number  of  tuples  in  R  and  S  are  represented  by  ||R||  and  ||S||.  The 
number  of  pages  of  main  memory  available  to  perform  the  join  operation  is  denoted  as  |M|  Given  |M| 
pages  of  main  memory.  {,W}#,  {M)s  specify  the  number  of  tuples  from  R  and  S  that  can  fit  in  mam 
memory  at  one  time. 


We  have  used  the  following  parameters  to  characterize  the  performance  of  the  computer  sys- 


tern  used. 

comp 

time  to  compare  keys 

hash 

time  to  hash  a  key 

move 

time  to  move  a  tuple 

swap 

time  to  swap  two  tuples 

IOseq 

time  to  perform  a  sequential  10  operation 

IOrasd 

time  to  execute  a  random  IO  operation 

To  simplify 

our  analysis  we  have  made  a  number  of  assumptions.  First,  we  have  assumed 

that 

|/?|<|Sj. 

Next,  several  quantities  need  to  be  incremented  by  slight  amounts  to  be  accurate. 

For 

example,  a  hash  table  or  a  sort  structure  to  hold  R  requires  somewhat  more  pages  than  |R|,  and  finding  a 
key  value  in  a  bash  table  requires,  on  the  average,  somewhat  more  than  one  probe  We  use  "F"  to  denote 
any  and  all  of  these  increments,  so  for  example  a  hash  table  to  hold  R  will  require  |R|*F  pages  To  sim¬ 
plify  cost  calculations,  we  have  assumed  no  overlap  of  CPU  and  10  processing  We  have  also  ignored  the 
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cost  of  reading  the  relations  initially  and  the  co6t  of  writing  the  result  of  the  join  to  disk  since  these  costs 
are  the  same  for  each  algorithm. 

In  any  sorting  or  hashing  algorithm,  the  implementor  must  make  a  decision  as  to  whether  the 
sort  structure  or  hash  table  will  contain  entire  tuples  or  only  Tuple  IDs  (TIDs)  and  perhaps  keys.  If  only 
TIDs  or  TID-key  pairs  are  used,  there  is  a  significant  space  savings  since  fewer  bytes  need  to  be  manipu¬ 
lated.  On  the  other  hand,  every  time  a  pair  of  joined  tuples  is  output,  the  original  tuples  must  be 
retrieved.  Since  these  tuples  will  most  likely  reside  on  disk,  the  cost  of  the  random  accesses  to  retrieve 
the  tuples  can  exceed  the  savings  of  using  TIDs  if  the  join  produces  a  large  number  of  tuples.  For¬ 
tunately  we  can  avoid  making  a  choice  as  the  decision  affects  our  algorithms  only  in  the  values  assigned 
to  certain  parameters.  For  example,  if  only  TID-key  pairs  are  used  then  the  parameter  measuring  the 
time  for  a  move  will  be  smaller  than  if  entire  tuples  are  manipulated. 

Three  algorithms  (Sort-merge,  GRACE,  and  Hybrid  hash)  are  much  easier  to  describe  if  they 
require  at  most  two  passes.  Hence  we  assume  the  necessary  condition  J |  5  |  »F  <  |  M  \ .  For  example,  if  F 
=  1.2.  then  |M|  is  only  1,000  pages  (4  megabytes  at  4K  bytes/page),  and  |S|  (and  |R|,  since  |  R  |  <  |  5  |  ) 
can  be  as  large  as  800,000  pages  (3.2  gigabytes)! 

3J.  Partitioning  a  Relation  by  Hash  Values 

If  |M|  <  |R|*F,  ea^h  of  the  hashing  algorithms  described  in  this  paper  requires  that  R  and/or 
S  be  partitioned  into  distinct  subsets  such  that  any  two  tuples  which  hash  to  the  same  value  lie  in  the 
same  subset  One  such  partitioning  is  into  the  sets  R,  such  that  Rs  contains  those  tuples  r  for  which  b(r| 
=  x.  We  call  such  a  partition  compatible  with  h 

A  general  way  to  create  a  partition  of  R  compatible  with  h  is  to  partition  the  set  of  hash 

values  X  that  h  can  assume  into  subsets,  say  A',. ...  ,Y«.  Then,  for  i  =  1 n  define  R,  to  be  all  tuples  r 

such  that  h(r|  is  in  A',.  In  fact,  every  partition  of  R  compatible  with  h  can  be  derived  in  this  general  way, 
beginning  with  a  partition  of  the  hash  values.  The  power  of  this  method  is  that  if  we  partition  both  R  and 
S  using  the  same  h  and  the  same  partition  of  hash  values,  say  into  /?, R ,  3nd  5,,.  ,5„,  then  the  prob¬ 
lem  of  joining  R  and  S  is  reduced  to  the  task  of  joining  Rx  with  5^  R2  with  S2,  etc.  (BABBF9. 
GOOD8IJ 
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In  order  for  the  hash  table  of  each  set  of  R  tuples  to  fit  in  memory,  |  ft,  |  *F  must  be  <  |  .U  |  . 
This  is  not  easily  guaranteed.  For  example,  how  can  one  choose  a  partition  of  R,  compatible  with  h,  ioto 
two  sets  of  equal  size?  One  might  try  trial  and  error  Begin  by  partitioning  the  set  of  hash  values  into 
two  sets  -V[  and  A'-  of  equal  size  and  then  consider  the  sizes  of  the  two  corresponding  sets  of  tuples  /?, 
and  R2-  If  the  R-sets  are  not  of  equal  size  then  one  changes  the  X  sets  to  compensate,  check  the  new  R- 
sets  again,  etc.  Despite  the  apparent  difficulties  of  selecting  the  sets  Jflt  X2,  ....  there  are  two  mitigating 
circumstances  Suppose  that  the  key  distribution  has  a  bounded  density  and  that  the  hash  function 
effectively  randomizes  the  keys.  If  the  number  of  keys  in  each  partition  is  large,  then  the  central  limit 
theorem  assures  us  that  the  relative  variation  in  the  number  of  keys  (and  hence  the  number  of  tuples|  in 
each  partition  will  be  small.  Furthermore,  if  we  err  slightly  we  can  always  apply  the  hybrid  hash  join 
recursively,  thereby  adding  an  extra  pass  for  the  overflow  tuples. 


3.4.  Sort-Merge  Join  Algorithm 

The  standard  sort-merge  algorithm  begins  by  producing  sorted  runs  of  tuples  which  are  on  the 
average  twice  as  long  as  the  number  of  tuples  that  can  fit  into  a  priority  queue  in  memory  [KNUT73| 
This  requires  one  pass  over  each  relation.  During  the  second  phase,  the  runs  are  merged  using  an  n-way 
merge,  where  n  is  as  large  as  possible  (since  only  one  output  page  is  needed  for  each  run,  n  can  be  equal 
to  |M|-1)  If  n  is  less  t^-an  the  number  of  runs  produced  by  the  first  phase,  more  than  two  phases  will  be 
needed.  Our  assumptions  guarantee  that  only  two  phases  are  needed 
The  steps  of  the  sort-merge  join  algorithm  are: 


( 1|  Scan  S  and  produce  output  runs  using  a  selection  tree  or  some  other  priority  queue  structure  Do  the 

2»|.W| 


same  for  R  A  typical  run  will  be  approximately 


pages  long  [KNT'T73|.  Since  the  runs  of 


2»  I  ,Vf  ]  |  R  I  t[? 

R  have  an  average  length  of  - ^ — —  p3ges,  there  are  -J — 1  such  runs.  Similarly,  there  are 

r  2  *  Af 


1  ^ — —  runs  of  S.  Since  S  is  the  larger  relation,  the  total  number  of  runs  is  at  most  -L^-l — — . 

2*|.W|  |V  | 

Therefore,  all  the  runs  can  be  merged  at  once  if  |  M  |  >  1  f  J  f ,  or  |  M  |  >  \J  |  S  |  *F  ,  and  we  have 

M 


assumed  |M|  to  be  at  least  / 1  S  \  *F  pages  Thus  all  runs  can  be  merged  at  once. 


a 


(2)  Allocate  one  page  of  memory  for  buffer  .pace  for  each  run  of  R  and  S.  Merge  runs  from  R  and  S 
concurrently.  When  a  tuple  from  R  matches  one  from  S,  output  the  pair. 


"rhe  cost  of  this  algorithm  (ignoring  the  cost  of  reading  the  relations  initially  and  the  co6t  of  writing  the 
result  of  the  join)  is: 


IPHIog;  +  ||S  ||log2-^r^)  *  (comp+  swap) 

+  (|*|+ 

+  <1*1  +  |S|)'/CW> 

+  HI*lllog2  +  ll5lllo*2-^y^7')  *  (comP+swaP) 


Insert  tuples  into  priority  queue 
to  form  initial  runs 

write  initial  runs 

Reread  initial  runs 

Insert  tuples  into  priority  queue 
for  final  merge 


+  (li*ll+  II*  II)  *  comP 


Join  results  of  final  merge. 


This  cost  formula  holds  only  if  a  tuple  from  R  does  not  join  with  more  than  a  page  of  tuples  from  S. 


3.5.  Simple-Hash  Join  Algorithm 

If  a  hash  table  containing  all  of  R  fits  into  memory,  i.e.  if  |  R  \  *F  <  |  M  | ,  the  simple-hash 
join  algorithm  proceeds  as  follows:  build  a  hash  table  for  R  in  memory  and  then  scan  S,  hashing  each 
tuple  of  S  and  checking  for  a  match  with  R  (to  obtain  reasonable  performance  the  hash  table  for  R  should 
contain  at  least  TID-key  pairs).  If  the  bash  table  for  R  will  not  fit  in  memory,  the  simple-hash  join  algo¬ 
rithm  fills  memory  with  a  hash  table  for  part  of  R,  then  scans  S  against  that  hash  table,  then  it  continues 
with  another  part  of  R,  scans  the  remainder  of  S  again,  etc. 

The  steps  of  the  simple-hash  join  algorithm  are: 


(1)  Let  P  =  min(|M|,  |R|*F).  Choose  a  hash  function  h  and  3  range  of  hash  values  so  that  —  pages  of 

r 

R-tuples  will  bash  into  that  range.  Scan  the  (smaller)  relation  R  and  consider  each  tuple  If  the 
tuple  hashes  into  the  chosen  range,  insert  the  tuple  into  a  P-page  hash  table  in  memory.  Otherwise, 
write  the  tuple  into  a  new  file  on  disk. 


(2)  Scan  the  larger  relation  S  and  consider  each  tuple  If  the  tuple  hashes  into  the  chosen  range,  check 
the  hash  table  of  R-tuples  in  memory  for  a  match  and  output  the  pair  if  a  match  occurs.  Otherwise, 
write  the  tuple  to  disk.  Note  that  if  key  values  of  the  two  relations  3re  distributed  similarlv,  there 
p  I  5  I 

—  L  pages  of  the  larger  relation  S  processed  in  this  pass. 

I  *  I 


will  be 
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(3)  Repeat  steps  (1)  and  (2),  replacing  each  of  the  relations  R  and  S  by  the  set  of  tuples  from  R  and  S 
that  were  "passed  over’  and  written  to  disk  in  the  previous  pass.  The  algorithm  ends  when  no 
tuples  from  R  are  passed  over. 


The  algorithm  requires 


[RjjF 


I M  I 


passes  to  execute.  We  denote  this  quantity  by  A.  Also  note  that  on 


[M] 


the  ith  pass,  i  =  1 . A-l,  ||/?||  -  »  *  — ~ —  tuples  of  R  are  passed  over.  The  cost  of  the  algorithm  is: 


||R||  •  (hash  +  move) 

+  ||S||  *  (hash  +  comp»F) 

+  ((A-1)*||R||  -  —  — I  *  — )  •  (hash+move) 

+  ((A-1)»||5T ||  -  ^  *  ^~-)  •  (hash-t-  move) 

+  ((A-1)«|R  |  -  «  -1~L|  v  2*IOseq 

+  ((A-I)*|S|  -  ,1M±  .  J|L)  *2  •  io. 


JSEQ 


Place  each  tuple  of  R  in  a  hash  table 
Check  a  tuple  of  S  for  a  match. 

Hash  and  move  passed-over  tuples  in  R 

Hash  and  move  passed-over  tuples  io  S. 

Write  and  read  passed-over  tuples  in  R. 

Write  and  read  passed-over  tuples  in  S 


3.0.  GRACE-Hash  Join  Algorithm 

As  outlined  in  [KITS83|,  the  GRACE-basb  join  algorithm  executes  as  two  phases.  The  first 
phase  begins  by  choosing  an  h  and  partitioning  the  set  of  hash  values  for  h  into  |M|  sets,  corresponding  to 
a  partition  of  R  and  S  into  |M|  sets  each,  such  that  R  is  partitioned  into  sets  of  approximately  equal  size. 
No  assumptions  are  made  about  set  sizes  in  the  partition  of  S.  The  algorithm  uses  one  page  of  main 
memory  as  an  output  buffer  for  each  of  the  |M|  sets  in  the  partition  of  R  and  S.  During  the  second  phase 
of  the  algorithm,  the  join  is  performed  using  a  hardware  sorter  to  execute  a  sort-merge  algorithm  on  each 
pair  of  sets  in  the  partition.  To  provide  a  fair  comparison  between  the  different  algorithms,  we  have  used 
hashing  to  perform  the  join  during  the  second  phase.  The  algorithm  proceeds  as  follows: 

(1)  Scan  R.  Using  h,  hash  each  tuple  and  place  in  the  appropriate  output  buffer  When  an  output 

buffer  fills,  it  is  written  to  disk.  After  R  has  been  completely  scanned,  Bush  ail  output  buffers  to 

disk. 

(2)  Scan  S.  Using  h,  hash  each  tuple  and  place  in  the  appropriate  output  buffer  When  an  output 

buffer  fills,  it  is  written  to  disk.  After  S  has  been  completely  scanned.  Bush  all  output  buffers  to 

disk. 
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Steps  (3)  and  (4)  below  are  repeated  for  each  set  ft,,  1<«<|A/|,  in  the  partition  for  R.  and  its 
corresponding  set  5,. 


(3)  Read  ft ,  into  memory  and  build  a  hash  table  for  it. 


We  pause  to  check  that  a  hash  table  for  ft,  can  fit  in  memory .  Assuming  that  all  the  sets  ft,  are  of 
equal  size,  since  there  are  |M|  of  them,  |  ft,  |  will  equal  |~|  pages  The  inequality 
|  ft,  |  ’F  <  |  M  |  is  equivalent  to  V  |  ft  |  *F  <  |  M  | ,  and  we  have  assumed  that  y|  S  |  <  |  .Vf  | 


(4)  Hash  each  tuple  of  5,  with  the  same  hash  function  used  to  build  the  hash  table  in  (3)  Probe  for  a 
match.  If  there  is  one,  output  the  result  tuple,  otherwise  proceed  with  then  next  tuple  of  S, 


The  cost  of  this  algorithm  is: 

(llRll  +  l|S||)  *  (hash  +  move) 
+  (|R|  +  |S|)  *  10, ,AND 
+  (|R|  +  | S | )  •  IOseq 
+  | |R [|  •  (hash  +  move) 

+  ||S||  «  (hash  +  comp»F) 


Hash  tuple  and  move  to  output  buffer 
Write  partitioned  relations  to  disk 
Read  partitioned  sets 
Build  hash  tables  in  memory 
Probe  for  a  match 


3.7.  Hybrid-Hash  Join  Algorithm 

In  our  hybrid-hash  algorithm,  we  use  the  large  main  memory  to  minimize  disk  traffic.  On  the 

first  pass,  instead  of  using  all  of  memory  3S  a  buffer  as  is  done  in  the  GRACE  algorithm,  only  as  many 

pages  )B.  defined  below)  as  are  necessary  to  partition  R  into  sets  that  can  fit  in  memory  are  used  The 

rest  of  memory  is  used  for  a  hash  table  that  is  processed  at  the  same  time  that  R  and  S  are  being  parti¬ 

tioned. 


Let  B  =  moz(0,  -L^LL^lipLL).  There  will  be  B4-  1  steps  in  the  hybrid-hash  algorithm 

First,  choose  a  hash  function  h  and  partition  R  into  ft*  ,fts,  such  that  a  hash  table  for  ft0  has 
|  M  |  -B  pages,  and  are  of  equal  size. 

Before  describing  the  algorithm  we  first  show  that  a  hash  table  for  ft,  will  fit  into  memory 
Assuming  that  all  sets  ft,  are  of  equal  size,  we  denote  |ft,|  by  p.  We  nust  show  that: 

P‘F<\M\  (a) 

Since  fl„  is  chosen  so  that  a  hash  table  for  it  fits  into  |  ,Vf  |  -  B  pages  of  memory,  we  have 
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\Ro\#-\M\-B  (b) 

Since  the  sum  of  all  the  A, -sets  is  R,  we  have 

|ft|=fl*p+  |ft0|  (c) 

If  a  hash  table  for  all  of  R  Sts  into  memory,  we  can  choose  B  =  0  and  be  done  with  it.  So  henceforth  we 

assume  |  M  |  <  |  R  \  *F .  Thus,  B—  I  —  j 

1  M  1  -1 

we  get: 


.  If  we  solve  (c)  for  p  and  substitute  (b)  in  the  result 


, -  1*1 

p  B  B 

Now  we  multiply  (d)  by  F  and  simplify  to  get: 


l*J  liLLig 

B  F'B 


(d) 


Finally,  we  substitute  for  B  in  (e)  to  get  (a),  which  was  our  goal.  Thus  we  have  demonstrated  that  a  hash 
table  for  R,  fits  into  memory 


Now  we  continue  with  the  algorithm.  Allocate  B  pages  of  memory  to  output  buffer  space,  and 
assign  the  other  |  .Vf  |  -B  pages  of  memory  to  a  hash  table  for  R0.  We  pause  again  to  check  that  there  are 
enough  pages  in  memory  to  hold  the  output  buffers,  i.e.  that  B  <  |  M  | .  If  we  substitute  for  B  in  the  ine¬ 
quality  B<\M\  and  simplify,  we  get  ■/ 1  ft  ]  *F  <  |  M  | ,  which  is  true  since  we  have  assumed  that  that 

\/|5|  •F<\M[. 

The  steps  of  the  hybrid-hash  algorithm  are: 

(1)  Assign  the  ith  output  buffer  page  to  /?,  /or  i  =/,... ,B.  Scan  R.  Hash  each  tuple  with  b  If  it 
belongs  to  /?„,  place  it  in  memory  in  the  hash  table  for  Rg.  Otherwise  it  belongs  to  R,  for  some 
i  >0.  so  move  it  to  the  ith  output  buffer  page.  When  this  step  has  finished,  we  have  a  hash  table  for 
R0  in  memory,  and  R  | . Rg  are  on  disk  The  partition  of  R  corresponds  to  a  partition  of  S  compa¬ 

tible  with  h,  into  sets  So,-.,SB. 


(2)  Assign  the  ith  output  buffet  page  to  5,  for  i=l B  Scan  S,  hashing  each  tuple  with  h  If  the 

tuple  is  in  Sg,  probe  the  hash  table  in  memory  for  a  match.  If  there  is  a  match,  output  the  the 
result  tuple.  If  there  is  no  match,  toss  the  tuple.  Otherwise,  the  hashed  tuple  belongs  to  S,  for  some 
i>0,  so  move  it  to  the  ith  output  buffer  page.  Now  RX,...,RB  and  Si,...,SB  are  on  disk. 


Repeat  steps  (3)  and  (4)  for  i  =  1,...,B. 

(3)  Read  R ,  and  build  a  hash  table  for  it  in  memory.  We  have  already  shown  that  a  hash  table  for  it 
will  fit  in  memory 
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(4)  Scan  5,,  bashing  each  tuple,  and  probing  the  bash  table  for  R,,  which  is  in  memory  If  there  is  a 
match,  output  the  result  tuple,  otherwise  to6s  the  S  tuple. 


1 


For  the  cost  computation,  denote  by  q  the  quotient 


1*0 


1*1 


-,  namely  the  fraction  of  R  represented  by  R0. 


To  calculate  the  cost  of  this  join  we  need  to  know  the  size  of  S0,  and  we  estimate  it  to  be  q*|S|.  Then  the 
fraction  of  R  and  S  sets  remaining  on  the  disk  is  i-q.  The  cost  of  the  hybrid-hash  join  is: 


(||R||  +  ||S||)*hash 

+  (l|R||+llsll)*(i-d)*mov* 

+  (|R|+  |S|)*(l-q  )*/0/tAfw 
+  <I|R||+  ||S||)«(l-q)*hash 
+  (|S||*F*comp 
+  ||R||»move 

+  (|R|+  ISIWi-qWOsEQ 


Partition  R  and  S 

Move  tuples  to  output  buffers 

Write  from  output  buffers 

Build  hash  tables  for  R  and  find  where  to  probe  for  S 

Probe  for  each  tuple  of  S 

Move  tuples  to  hash  tables  for  R 

Read  sets  into  memory 


3.8.  Comparison  of  the  4  Join  Algorithms 

In  Figure  1  we  have  displayed  the  relative  performance  of  the  four  join  algorithms.  The  verti¬ 
cal  axis  is  execution  time  in  seconds.  The  horizontal  axis  iB  the  ratio  of  ^ .  Note  that  above  a  ratio 

of  1.0  all  algorithms  have  the  same  execution  time  as  at  1.0,  except  that  sort-merge  will  improve  to 
approximately  900  seconds,  since  fewer  IO  operations  3re  needed.  The  parameter  settings  used  are  shown 
in  Table  2.  We  have  assumed  that  there  are  at  least  J  |  5  |  *F  pages  in  memory.  For  the  values  specified 

|  \i  j 

in  Table  2,  this  corresponds  to  t-‘_  .■*■„  =  0.009. 

|  R  |  *F 

In  generating  these  graphs  we  have  used  the  cost  formulas  given  above  with  one  exception. 
The  IOram  term  used  in  the  cost  formula  for  hybrid  hash  should  be  replaced  by  IOSeq  in  the  case  that 

I  R  I  *F 

there  is  only  one  output  buffer.  There  is  only  one  output  buffer  whenever  |  M  |  >  -1  -J-  (0.5  on  the 

horizontal  axis  of  Figure  1).  The  abrupt  discontinuity  in  the  performance  of  the  hybrid  hash  algorithm  at 
0.5  occurs  because  when  memory  space  decreases  slightly,  changing  the  number  of  output  buffers  from  one 
to  two,  the  IO  time  is  suddenly  calculated  as  a  multiple  of  lORAND  instead  of  IOSEQ.  Even  when  there 
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Table  2  —  Parameter  Settings  Used 


comp 

time  to  compare  keys 

3  microseconds 

hash 

time  to  hash  a  key 

9  microseconds 

move 

time  to  move  a  tuple 

20  microseconds 

swap 

time  to  swap  two  tuples 

60  microseconds 

IOseq 

sequential  10  operation  time 

10  milliseconds 

10 fi  AND 

random  10  operation  time 

23  milliseconds 

F 

universal  "fudge"  factor 

1.2 

|s| 

size  of  S  relation 

10,000  pages 

|H| 

size  of  R  relation 

10,000  pages 

l|R||/|R| 

number  of  R  tuples/page 

40 

!|S|1/|S| 

number  of  S  tuples/page 

40 

ers,  /OflaND 

is  probably  too  large  a  figure 

to  use  to  measure 

have  not  made  that  change.  This  is  what  causes  our  graphs  to  indicate  that  simple  bash  will  outperform 
hybrid  hash  in  a  small  region;  in  practice  hybrid  hash  will  probably  always  outperform  simple  hash 

We  have  generated  similar  graphs  for  the  range  of  parameter  values  shown  in  Table  3  For 
each  of  these  values  we  observed  the  same  qualitative  shape  and  relative  positioning  of  the  different  algo¬ 


rithms  as  shown  in  Figure  f.  Thus  oar  conclusions  do  not  appear  to  depend  on  the  particular  parameter 
values  that  we  have  chosen. 


3.0.  Algorithms  for  Other  Relational  Operations 

While  we  have  not  analyzed  algorithms  for  the  remaining  relational  operations  such  as  aggre¬ 
gate  function  and  projection  with  duplication  elimination,  we  can  offer  the  following  observations.  For 
aggregate  functions  in  which  related  tuples  must  be  grouped  together  (compute  average  employee  salary 

Table  3  -  Other  Parameter  Settings  Tested 


comp 

1  to  10  microseconds 

hash 

2  to  50  microseconds 

move 

10  to  50  microseconds 

swap 

20  to  250  microseconds 

IOseq 

5  to  10  milliseconds 

I  Off, (,VD 

15  to  35  milliseconds 

F 

1.0  to  1.4 

|S| 

10,000  to  200,000  pages 

IIRII 

100,000  to  1,000,000  tuples 
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by  manager),  the  result  relation  will  contain  one  tuple  per  group.  If  there  is  enough  memory  to  hold  the 
result  relation,  then  the  fastest  algorithm  will  be  a  one  pass  hashing  algorithm  in  which  each  incoming 
tuple  is  hashed  on  the  grouping  attribute.  If  there  is  not  enough  memory  to  hold  the  result  relation  (prob¬ 
ably  a  very  unlikely  event  as  who  would  ever  want  to  read  even  a  4  million  byte  report)  then  a  variant  of 
the  hybnd-hash  algorithm  described  for  the  join  operator  appears  fastest.  This  same  hybrid-bash  algo¬ 
rithm  appears  to  be  the  algorithm  of  choice  for  the  projection  operator  as  projection  with  duplicate  elimi¬ 
nation  is  very  similar  in  nature  to  the  aggregate  function  operation  (in  projection  we  are  grouping  identi¬ 
cal  tuples  while  in  an  aggregate  function  operation  we  are  grouping  tuples  with  an  identical  partitioning 
attribute). 

4.  Access  Planning  and  Query  Optimisation 

In  the  classic  paper  on  access  path  selection  by  Selinger  (SELI79),  techniques  are  developed  by 
choosing  the  "best"  processing  strategy  for  a  query.  "Best"  is  defined  to  be  the  plan  that  minimizes  the 
function  W*|CPU|  +  |I/0|  where  |CPU|  is  the  amount  of  CPU  time  consumed  by  a  plan,  (I/O)  is  the 
number  of  I/O  operations  required  for  a  plan,  and  W  is  a  weighting  factor  between  CPU  and  I/O 
resources.  Choosing  a  "best"  plan  involves  enumerating  all  possible  "interesting"  orderings  of  the  opera¬ 
tors  in  query,  all  alternative  algorithms  for  each  operator,  and  ail  alternative  access  paths.  The  process  is 
complicated  by  the  fact  the  order  in  which  tuples  are  produced  by  an  operator  can  have  a  significant 
effect  on  the  execution  time  of  the  subsequent  operator  in  the  query  tree. 

The  analysis  presented  in  Section  3  indicates  that  algorithms  based  on  hashing  (the  hybrid- 
hash  algorithm  in  the  C3se  of  the  join  operator  3nd  the  simple-hash  algorithm  to  process  projection  and 
aggregate  function  operators)  are  the  fastest  algorithms  when  a  large  amount  of  primary  memory  is  avail¬ 
able.  Since  the  performance  of  these  algorithms  is  not  affected  by  the  input  order  of  the  tuples  and  since 
there  is  only  one  algorithm  to  choose  from,  query  optimization  is  reduced  to  simply  ordering  the  operators 
so  that  the  most  selective  operations  are  pushed  towards  the  bottom  of  the  query  tree. 
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S.  Recovery  In  Large  Memory  Databases 

5.1.  Introduction  and  Assumptions 

High  transaction  processing  rates  can  be  obtained  on  a  processor  with  a  large  amount  of  main 
memory,  since  input/output  delays  can  be  significantly  reduced  by  keeping  the  database  resident  in 
memory  For  example,  if  the  entire  database  is  resident  in  memory,  a  transaction  would  never  need  to 
access  data  pages  on  disk. 

However,  keeping  a  large  portion  of  the  database  in  volatile  memory  presents  some  unique 
challenges  to  the  recovery  subsystem.  The  in-memory  version  of  the  database  may  differ  significantly 
from  its  latest  snapshot  on  disk.  A  simple  recovery  scheme  would  proceed  by  first  reloading  the  snapshot 
on  disk,  and  then  applying  the  transaction  log  to  bring  it  up  to  date.  Unless  the  recovery  system  does 
more  than  simple  logging  during  normal  transaction  processing,  recovery  times  would  become  intolerably 
long  using  this  approach. 

Throughout  this  section,  we  will  assume  that  the  entire  database  fits  in  main  memory  In  such 
an  environment,  we  need  only  be  concerned  with  log  writes.  A  "typical"  transaction  writes  400  bytes  of 
log  data  (40  bytes  for  transaction  begin/end,  360  bytes  for  old  values/new  values),2  which  takes  10  ms 
(time  to  write  one  4096  byte  page  without  a  disk  seek).  We  also  assume  that  a  small  portion  of  memory 
can  be  made  stable  by  providing  it  with  a  back-up  battery  power  supply. 

5.2.  Limits  to  Transaction  Throughput 

In  conventional  logging  schemes,  a  transaction  cannot  commit  until  its  log  commit  record  has 
been  written  to  stable  storage.  Most  transactions  have  very  simple  application  logic,  and  perform  three  to 
four  page  reads  and  writes.  While  transactions  no  longer  need  to  read  or  write  data  pages  if  the  database 
is  memory  resident,  they  still  need  to  perform  at  least  one  log  I/O.  Assuming  a  single  log  device,  the  sys¬ 
tem  could  commit  at  most  100  transactions  per  second  (1  second  /  10  ras  per  commit  =  100  committed 
transactions  per  second).  The  time  to  write  the  log  becomes  the  major  bottleneck. 

A  scheme  that  amortizes  this  log  I/O  across  several  transactions  is  based  on  the  notion  of  a 

2  These  »re  b&Jlp&rk  estimate:,  baaed  on  the  dimple  banking  database  and  transactions  in  Jim  Gray.  'Notes  on  Database 
Operating  Systems.*  IBM  R J2188f 30001 ),  {February  23,  1978) 


pre-committed  transaction.  When  a  transaction  is  ready  to  complete,  the  transaction  management  system 
places  its  commit  record  in  the  log  buffer.  The  transaction  releases  all  locks  without  waiting  for  the  com¬ 
mit  record  to  he  written  to  disk.  The  transaction  is  delayed  from  committing  until  its  commit  record  actu¬ 
ally  appears  on  disk.  The  "user"  is  not  notified  that  the  transaction  has  committed  until  this  event  has 
occurred.5 

By  releasing  its  locks  before  it  commits,  other  transactions  can  read  the  pre-committed 
transaction's  dirty  data.  Call  these  dependent  transactions.  Reading  uncommitted  data  in  this  way  does 
not  lead  to  an  inconsistent  state  as  long  as  the  pre-committed  transaction  actually  commits  before  its 
dependent  transactions.  A  pre-committed  transaction  does  not  commit  only  if  the  system  crashes,  never 
because  of  a  user  or  system  induced  abort.  As  long  as  records  are  sequentially  added  to  the  log,  and  the 
pages  of  the  log  buffer  are  written  to  disk  in  sequence,  a  pre-committed  transaction  will  have  its  commit 
record  on  disk  before  its  dependent  transactions. 

The  transactions  with  commit  records  on  the  same  log  page  are  committed  as  a  group,  and  are 
called  the  commit  group.  A  single  log  I/O  is  incurred  to  commit  all  transactions  within  the  group.  The 
size  of  a  commit  group  depends  on  how  many  transactions  can  fit  their  logs  within  a  unit  of  log  write  (i.e. , 
a  log  buffer  page).  Assuming  the  "typical’  transaction,  we  could  have  up  to  ten  transactions  per  commit 
group.  The  transaction  throughput  can  be  increased  by  another  order  of  magnitude,  to  1000  transactions 
per  second  (1  second  /  10  ms  to  commit  10  transactions  =  1000  transactions  committed  per  second) 

The  throughput  can  be  further  increased  by  writing  more  than ’one  log  page  at  a  time,  by  par¬ 
titioning  the  log  across  several  devices.  Since  more  than  one  log  I/O  can  be  active  simultaneously,  the 
recovery  system  must  maintain  a  topological  ordering  among  the  log  pager,  so  the  commit  record  of  a 
dependent  transaction  is  not  written  to  disk  before  the  commit  record  of  its  associated  pre-committed 
transaction  The  roots  of  the  topological  lattice  can  be  written  to  disk  simultaneously 

To  maintain  the  ordering,  and  thus  the  serialization  of  the  transactions,  the  lock  table  of  the 
concurrency  control  component  must  be  extended.  Associated  with  each  lock  are  three  sets  of  transac¬ 
tions  active  transactions  that  currently  hold  the  lock,  transactions  that  are  waiting  to  be  granted  the 

3  TVs  notion  of  gronp  commits  appears  to  be  part  of  tbe  onwntten  database  folklore  Tbe  Systero-R  implementors  claim  to 
bave  implemented  it  To  oar  knowledge,  neither  tbe  idea  nor  tbe  implementation  details  bas  yet  appeared  in  pnnt 
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lock,  and  pre-committed  transactions  that  have  released  the  lock  but  have  not  yet  committed.  When  a 
transaction  is  granted  a  lock,  it  becomes  dependent  on  the  pre-committed  transactions  that  formerly  held 
the  lock.  The  dependency  list  is  maintained  in  the  transaction's  descriptor  in  the  active  transaction  table. 
When  a  transaction  becomes  pre-committed,  it  moves  from  the  holding  list  to  the  pre-committed  list  for 
all  of  its  locks  (we  assume  all  locks  are  held  until  pre-commit),  and  the  committed  transactions  in  its 
dependency  list  are  removed.  In  becoming  pre-committed,  the  transaction  joins  a  commit  group.  The 
commit  groups  of  the  remaining  transactions  in  its  dependency  list  are  added  to  those  on  which  its  com¬ 
mit  group  depends.  A  commit  group  cannot  be  written  to  disk,  and  thus  commit,  until  all  the  groups  it 
depends  on  have  previously  been  committed 

For  recovery  processing,  a  single  log  is  recreated  by  merging  the  log  fragments,  as  in  a  sort- 
merge  For  example,  to  roll  backwards  through  the  log,  the  most  recent  log  page  in  each  fragment  is 
examined.  The  page  with  the  most  recent  timestamp  is  processed  first,  it  is  replaced  by  the  next  page  in 
that  fragment,  and  the  most  recent  log  page  of  the  group  is  again  determined.  By  a  careful  buffering  stra¬ 
tegy.  the  reading  of  log  p3ges  from  different  fragments  can  be  overlapped,  thus  reducing  recovery  time. 

5  .3.  Checkpointing  the  Database 

An  approach  for  reducing  recovery  time  is  to  periodically  checkpoint  the  database  to  stable 
storage  [GRAY81].  Checkpointing  limits  recovery  activities  to  those  transactions  that  are  active  at  the 
checkpoint  or  who  have  begun  since  the  last  checkpoint.  System-R,  for  example,  takes  an  action  con¬ 
sistent  checkpoint,  during  which  no  storage  system  operations  may  be  in  progress  (a  transaction  consists 
of  several  such  actions,  which  correspond  roughly  to  logical  reads  and  writes  of  the  database).  Dirty 
buffer  pool  pages  are  forced  to  disk.  Since  the  database  is  assumed  to  be  large,  a  large  number  of  dirty 
pages  will  oeed  to  be  written  to  disk,  making  the  database  unavailable  for  an  intolerably  long  amount  of 
time.  Consider  the  case  of  1000  transactions  per  second,  two  dirty  pages  per  transaction,  and  30  seconds 
between  checkpoints.  In  the  worst  C3se.  60,000  pages  would  need  to  be  written  at  the  checkpoint! 

We  would  like  to  overlap  checkpoint  with  transaction  activity  Let  be  the  set  of  pages 

that  have  been  updated  since  the  last  checkpoint.  Once  a  checkpoint  begins,  transaction  activity  can  con¬ 
tinue  if  updates  to  pages  of  cause  new  in-memory  versions  to  be  created,  leaving  the  old  versions 
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available  to  be  written  to  disk.  A  checkpointed,  action  consistent  state  of  the  database  is  always  main¬ 
tained  on  disk.  At  a  checkpoint,  a  portion  of  the  state  is  replaced  by  Am,*,.  To  guarantee  that  the  state 
is  updated  ^carefully we  use  a  batch  update  approach  by  first  writing  these  pages  to  stable  storage.  We 
denote  the  batch  update  file  by  Adlst  If  the  system  crashes  while  the  disk  state  is  being  overwritten  from 
memory,  it  can  be  reconstructed  from  the  pages  in 

The  algorithm  proceeds  in  two  phases.  In  phase  1,  Awm  is  written  to  A*,*.  During  phase  2, 
the  pages  \n  Am,m  are  copied  to  their  original  locations  on  disk  For  the  algorithm  to  work,  we  must 
assume. 

(1)  Extra  disk  space  is  available  to  hold  A*st. 

(2)  Extra  memory  space  is  available  to  hold  Am,w 

(3)  No  dirty  page  is  ever  written  to  disk  except  during  a  checkpoint. 

Time  stamps  are  used  to  determine  membership  in  AWfW.  The  timestamp  T:p  indicates  when 
the  current  checkpoint  began,  or  is  zero  if  no  checkpoint  is  in  progress.  When  a  transaction  attempts  to 
update  a  page,  the  page's  timestamp  Tvagt  is  compared  to  T:p.  If  Tpatt  <  Tcp  and  the  page  is  dirty,  a 
new  version  of  the  page  is  created  and  the  in-core  page  table  points  to  the  new  page.  The  update  is 
applied  to  the  new  page.  The  page’s  timestamp  is  updated  to  reflect  the  latest  modification. 

To  obtain  an  action  consistent  state  for  the  checkpoint,  the  system  is  initially  quiesced  T  v  is 
set  to  the  current  time  clock  to  indicate  that  a  checkpoint  has  begun.  The  active  transaction  list  is  con¬ 
structed  for  later  inclusion  in  the  log.  Transaction  activity  can  now  resume,  since  the  old  versions  in 
A*,*  can  no  longer  be  updated  Memory  pages  who  are  dirty  and  for  which  Tpage  <  T:p  are  written  to 
A,(il  After  Ail5l  has  been  created,  a  begin  checkpoint  record  is  written  to  the  log  with  T:p  and  the  list  of 
active  transactions,  indicating  that  phase  l  of  checkpoint  is  complete.  The  pages  of  are  then  writ- 

ten  to  their  original  locations  on  disk,  making  the  disk  state  identical  to  the  in-memory  state  as  of  T  P 
An  end  checkpoint  record  is  written  to  the  log  to  indicate  the  completion  of  phase  2,  A*u  is  removed,  and 
T:p  is  reset. 

The  advantage  of  the  algorithm  is  that  checkpointing  can  K  done  in  parallel  with  transaction 
activity  while  maintaining  an  action  consistent  state  on  disk.  This  is  particularly  needed  in  a  high  update 
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transaction  environment,  which  can  generate  a  large  number  of  updated  pages  betwpt-n  checkpoints 
Further,  as  soon  as  a  checkpoint  completes,  another  can  commence  with  only  a  negligible  interruption  of 
service  Checkpointing  proceeds  at  the  maximum  rate  possible,  i.e.,  as  fast  as  pages  can  be  written  to 
disk,  thus  keeping  the  log  processing  time  to  a  minimum  during  recovery 

S.4.  Reducing  Log  SU« 

WbiJe  checkpointing  will  reduce  tbe  time  to  process  the  log,  by  reducing  the  necessary  redo 
activity  it  does  not  help  reduce  the  log  size.  Tbe  large  amount  of  real  memory  available  to  us  can  be 
used  to  reduce  the  log  size,  if  we  assume  that  a  portion  of  memory  can  be  made  stable  against  system 
power  failures.  For  example,  batteries  can  be  used  as  a  back-up  power  supply  for  low  power  CMOS 
memory  chips  We  further  assume  that  such  memory  is  too  expensive  to  be  used  for  all  of  real  memory 

Partition  real  memory  into  a  stable  portion  and  a  conventional,  non-stable  portion  The  stable 
portion  will  be  used  to  hold  an  in-memory  log,  which  can  be  viewed  as  a  reliable  disk  output  queue  for  log 
data  Transactions  commit  as  soon  as  they  write  their  commit  records  into  the  in-memory  log.  Log  data 
is  written  to  disk  as  soon  as  a  log  buffer  page  fills  up  Given  the  buffering  of  the  log  in  memory,  it  may  be 
more  efficient  to  write  the  log  a  track  at  a  time.  In  addition,  multiple  log  writes  can  be  directed  to 
different  log  devices  without  the  need  for  the  bookkeeping  described  above.  However,  in  the  steady  state, 
the  number  of  transactions  processed  per  second  is  still  limited  by  how  fast  we  can  empty  buffer  pages  bv 
writing  them  to  disk-based  stable  storage. 

Stable  memory  does  not  seem  to  gain  much  over  the  group  commit  mechanism.  However,  the 
log  ran  be  significantly  compressed  while  it  is  buffered  in  stable  memory  The  log  entries  for  a  particular 
transaction  a/e  of  the  form 

Begin  Transaction 
<Old  Value,  New  Value > 

<Old  Value,  New  Value > 

End  Transaction 

A  transaction  s  space  in  the  log  can  be  significantly  reduced  if  only  new  values  are  written  to  the  disk 
based  log  (approximately  half  of  the  size  of  the  log  stores  the  old  values  of  modified  data  —  this  is  only 
needed  if  the  transaction  must  be  undone)  This  is  advantageous  for  space  management,  and  also  reduces 
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the  recovery  time  by  shortening  the  log. 

In  the  conventional  approach,  log  entries  for  all  transactions  are  intermixed  in  the  log.  The 
log  manager  maintains  a  list  of  committed  transactions,  and  removes  their  old  value  entries  from  log 
pages  before  writing  them  to  disk  A  transaction  is  removed  from  the  list  as  soon  as  its  commit  record 
has  been  written  to  disk.  A  more  space  efficient  alternative  is  to  maintain  the  log  on  a  per  transaction 
basis  in  the  stable  memory  If  enough  space  can  be  set  aside  to  accommodate  the  logs  of  all  active  tran¬ 
sactions  then  only  new  values  of  committed  transactions  are  ever  written  to  disk. 

Stable  memory  also  assists  in  reducing  the  recovery  time.  To  recover  committed  updates  to 
pages  since  the  last  checkpoint,  the  recovery  system  needs  to  find  the  log  entry  of  the  oldest  update  that 
refers  to  an  uncheckpointed  page  A  table  C3D  be  placed  in  stable  memory  to  record  which  pages  have 
been  updated  since  their  last  checkpoint,  and  the  log  record  id  of  the  first  operation  that  updated  the 
page.  When  a  page  is  checkpointed  to  disk,  its  update  status  is  reset.  The  log  record  id  of  the  next 
update  on  the  page  is  entered  into  the  table.  The  oldest  entry  in  the  table  determines  the  point  in  the  log 
from  which  recovery  should  commence. 

6.  Conclusion*  and  Future  Research 

In  this  paper  we  have  examined  changes  to  the  organization  of  relational  database  manage¬ 
ment  systems  to  permit  effective  utilization  of  very  large  main  memories.  We  have  shown  that  the  B+- 
tree  access  method  will  remain  the  preferred  access  method  for  keyed  access  to  tuples  in  a  relation  unless 
more  that  80%  -  90°}.  of  the  database  ran  be  kept  in  main  memory.  We  have  also  evaluated  alternative 
algorithms  for  complex  relational  operators.  We  have  shown  that  once  the  size  of  main  memory  exceeds 
the  square  root  of  the  size  of  the  relations  being  processed,  that  the  fastest  algorithms  for  the  join,  projec¬ 
tion,  and  aggregate  operators  are  based  on  a  hashing  It  is  interesting  to  note  that  this  result  3lso  holds 
for  "small”  main  memories  and  "small”  databases.  Finally,  we  have  discussed  recovery  techniques  for 
memory-resident  databases. 


There  appear  to  be  a  number  of  promising  areas  for  future  research.  These  include  buffer 
management  strategies  (how  to  efficiently  manage  very  large  buffer  pools),  the  effect  of  virtual  memory  on 
query  processing  algorithms,  and  concurrency  control.  While  locking  is  generally  accepted  to  the  algo- 
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rithm  of  choice  for  disk  resident  databases,  a  versioning  mechanism  [REED83|  may  provide  superior  per 
formance  for  memory  resident  systems. 
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